Goto

Collaborating Authors

 algorithm 5


Sample Complexity of Policy Gradient for Log-Growth Control

arXiv.org Machine Learning

We study the sample complexity of policy gradient for log-growth control -- the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective $J(K) = \mathbb{E}[\log|1+BK|]$ is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain $K^*$ always places the noise singularity $b_{\rm sing}(K) = -1/K$ in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity $\tilde{O}(1/ฮท)$ when the noise density is known and $\tilde{O}(ฮท^{-(2s+1)/(2s)})$ when it must be estimated, for $C^s$ noise densities with $s \geq 2$.


Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

arXiv.org Machine Learning

In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-ฮณ)^{-1}\log((1-ฮณ)^{-1}ฮต^{-1}))$ for computing an $ฮต$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.


Penalty-Based First-Order Methods for Bilevel Optimization with Minimax and Constrained Lower-Level Problems

arXiv.org Machine Learning

We study a class of bilevel optimization problems in which both the upper- and lower-level problems have minimax structures. This setting captures a broad range of emerging applications. Despite the extensive literature on bilevel optimization and minimax optimization separately, existing methods mainly focus on bilevel optimization with lower-level minimization problems, often under strong convexity assumptions, and are not directly applicable to the minimax lower-level setting considered here. To address this gap, we develop penalty-based first-order methods for bilevel minimax optimization without requiring strong convexity of the lower-level problem. In the deterministic setting, we establish that the proposed method finds an $ฮต$-KKT point with $\tilde{O}(ฮต^{-4})$ oracle complexity. We further show that bilevel problems with convex constrained lower-level minimization can be reformulated as special cases of our framework via Lagrangian duality, leading to an $\tilde{O}(ฮต^{-4})$ complexity bound that improves upon the existing $\tilde{O}(ฮต^{-7})$ result. Finally, we extend our approach to the stochastic setting, where only stochastic gradient oracles are available, and prove that the proposed stochastic method finds a nearly $ฮต$-KKT point with $\tilde{O}(ฮต^{-9})$ oracle complexity.


facts

Neural Information Processing Systems

Let f be a non-negative submodular function on [n] that is bounded above by 1. First, suppose that Xi are monotone increasing. Construct a sequence X0i as follows. If i / I then set X0i = X0i 1. If i I then set X0i = X0i 1 (Xi \Xi 1). For the monotone decreasing case, consider the submodular function g(X) = f([n] X) and set Yi = [n] Xi.



Context-lumpable stochastic bandits

Neural Information Processing Systems

We consider a contextual bandit problem with S contexts and K actions. In each round t = 1,2,... the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into r min{S,K}groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an ฮต-optimal policy after using at most eO(r(S+K)/ฮต2) samples with high probability and provide a matching โ„ฆ(r(S + K)/ฮต2) lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time T is bounded by eO( p r3(S+K)T). To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and eO( p poly(r)(S+K)T)minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.


Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms

Neural Information Processing Systems

This paper focuses on supervised and unsupervised online label shift, where the class marginals Q(y) varies but the class-conditionals Q(x|y) remain invariant. In the unsupervised setting, our goal is to adapt a learner, trained on some offline labeled data, to changing label distributions given unlabeled online data. In the supervised setting, we must both learn a classifier and adapt to the dynamically evolving class marginals given only labeled online data. We develop novel algorithms that reduce the adaptation problem to online regression and guarantee optimal dynamic regret without any prior knowledge of the extent of drift in the label distribution. Our solution is based on bootstrapping the estimates of online regression oracles that track the drifting proportions. Experiments across numerous simulated and real-world online label shift scenarios demonstrate the superior performance of our proposed approaches, often achieving 1-3% improvement in accuracy while being sample and computationally efficient. Code is publicly available at this url.